Machine Learning: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

STEP 0 - Domain Knowledge¶

To complete this exploratory analysis and build a model to identify potential customers who have a higher probability of purchasing the loan, this domain knowledge will be helpful:

1) Banking and Finance

2) Marketing and Customer Segmentation

3) Evaluation Metrics

4) Ethics and Compliance


Based on Zip codes in dataset: Official website for State of California
https://dfpi.ca.gov/consumer-financial-education-other-loans/

Personal Loans One of the most attractive things about personal loans is they can be used for any reason. Personal loans may be an option for people with credit card debt and want to reduce their interest rate by transferring balances. Like other loans, the interest rate and loan terms depend on your credit history and financial situation. The term of a personal loan is generally between 12-60 months, the amount can be as little as \$1,000 to as much as $100,000 or more, and the APR interest may range from 6% – 36%. It is important to consider multiple lenders and negotiate the best terms for your situation.


Importing necessary libraries¶

In [141]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
In [142]:
# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Libraries to build K-Means Clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy import stats

# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To perform statistical analysis
import scipy.stats as stats

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    make_scorer,
)

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

Loading the dataset¶

In [143]:
# Initial load
data = pd.read_csv("/content/Loan_Modelling.csv")
In [144]:
# Copy data to another variable to preserve original
loan = data.copy()

Data Overview¶

  • Observations
  • Sanity checks
In [145]:
# First 5 columns
loan.head()
Out[145]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [146]:
# Last 5 columns
loan.tail()
Out[146]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [147]:
# Shape of dataset
loan.shape
Out[147]:
(5000, 14)
  • The dataset has 5,000 rows and 14 columns
In [148]:
loan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Insight¶

  • All numeric Dtypes

    • all columns int64, except for CCAvg which is Float64.
  • No missing values

  • Small memory usage

In [149]:
# Statistical Summary
loan.describe()
Out[149]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.467954 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 -3.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000

Insight¶

  • All columns have a count of 5000, meaning no missing values.

  • Age has a mean of 45 and a standard deviation of about 11.4. The min age is 23 and the max is 67.

  • Experience has a mean of 20 and a standard deviation of 11.5. The min is -3 and the max is 43 years.

  • Income has a mean of 74K and a standard deviation of 46K. The values range from 8K to 224K.

  • Zip codes will be analyzed further

  • There are 4 unique values in family column.

  • CCavg has a mean of 1.94 and a standard deviation of 1.7. The values range from min 0.0 to max 10.0.

  • Education column has 3 unique values

  • Mortgage has a mean of 56.5K and a standard deviation of 101K. The standard deviation is greater than the mean. We will investigate further.

  • Personal_Loan Securities_Account CD_Account Online CreditCard will be analyzed futher

In [150]:
# Check null values
loan.isnull().sum()
Out[150]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64

No missing values present

In [151]:
# Check for duplicates
loan.duplicated().sum()
Out[151]:
0

No duplicates present


Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

EDA Functions¶

In [152]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, ascending=False):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [153]:
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [154]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Univariate Analysis¶

  • Univariate statistics summarize only one variable at a time
In [155]:
#Show first few columns for reminder
loan.head()
Out[155]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [156]:
# review ID column
loan['ID'].value_counts()
Out[156]:
ID
1       1
3331    1
3338    1
3337    1
3336    1
       ..
1667    1
1666    1
1665    1
1664    1
5000    1
Name: count, Length: 5000, dtype: int64
In [157]:
loan['ID'].value_counts().sum()
Out[157]:
5000

ID column will not be further analyzed as each ID is unique.

In [158]:
# Review Zipcode column
loan['ZIPCode'].value_counts()
Out[158]:
ZIPCode
94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94087      1
91024      1
93077      1
94598      1
Name: count, Length: 467, dtype: int64
  • Zipcode will be analyzed further as it has recurring zipcodes
  • All zip codes appear to be in the state of California

Further Insight

  • Domain knowledge should focus on Californian: laws, interest rates and process when issuing personal loans.

Columns Analyzed Order¶

1 Age
2 Experience
3 Income
4 ZIPCode
5 Family
6 CCAvg
7 Education
8 Mortgage
9 Personal_Loan
10 Securities_Account
11 CD_Account
12 Online
13 CreditCard

1. Observations on Age¶

In [159]:
# Histplot & Boxplot
histogram_boxplot(loan, "Age")
In [160]:
# Breakdown of Age
labeled_barplot(loan, "Age", perc=True, n=None, ascending = True)

Insight¶

  • No outliers present
  • Average age is 45 years old
  • Distribution is uniform

Futher insight

  • Age ranges from 23 - 67
  • Age 35, 43, 52, 54 and 58 have the highest count (35 / 43 are the highest)
  • Age 24, 66, 76 and 23 have the lowest count (67 / 23 are the lowest)

2.Observation on Experience¶

In [161]:
# Histplot & Boxplot
histogram_boxplot(loan, "Experience")
In [162]:
# Breakdown of experience
labeled_barplot(loan, "Experience", perc=True, n=None, ascending = True)
In [163]:
#Max value
loan['Experience'].max()
Out[163]:
43
In [164]:
#Min value
loan['Experience'].min()
Out[164]:
-3

Insight¶

  • No outliers present
  • Average age is 20 years old
  • Distribution is uniform
  • Experience ranges from -3 to 43
  • Highest experience count in years is 32 with 3.1%
  • Negative years and higher years of expereince are much lower -1, -2, 42, -3, 43

Futher insight

  • Why is there less than 0 for expereince? Could be entry error. Needs further analysis.

3.Observation on Income¶

In [165]:
# Histplot & Boxplot
histogram_boxplot(loan, "Income")
In [166]:
loan['Income'].max()
Out[166]:
224
In [167]:
loan['Income'].min()
Out[167]:
8
In [168]:
loan['Income'].median()
Out[168]:
64.0

Insight¶

  • Many outliers present
  • Median income is 64k
  • Right skewed
  • Income ranges from 8K to 224K
  • Highest peak is around 40k to 50k and second peak is 70k to 80k

4.Observation on Zipcode¶

In [169]:
# Check recurring zip codes
loan['ZIPCode'].nunique()
Out[169]:
467
In [170]:
# Highest count for zip codes
loan['ZIPCode'].value_counts()
Out[170]:
ZIPCode
94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94087      1
91024      1
93077      1
94598      1
Name: count, Length: 467, dtype: int64
In [171]:
#Count occurrences of each zip code
zipcode_counts = loan['ZIPCode'].value_counts()

#Select the top 5 zip codes
top_5_zipcodes = zipcode_counts.head(5)

# Convert to DataFrame for easier plotting
top_5 = top_5_zipcodes.reset_index()
top_5.columns = ['ZIPCode', 'count']

#Plot the histogram
sns.barplot(data=top_5, x='ZIPCode', y='count')

plt.xlabel('Zipcode')
plt.ylabel('Count')
plt.title('Top 5 Zipcodes by Count')
plt.show()

Insight¶

All zip codes are for locations within California. With the highest count being for:

  • 94720 (Berkeley, CA), 169 count
  • 94305 (Stanford, CA), 127 count
  • 95616 (Davis, CA), 116 count
  • 90095 (Los Angeles, CA), 71 count
  • 93106 (Santa Barbara, CA), 57 count

This could be due to the locations for All life bank. Also located in higher average income areas resulting in these zip codes having a higher count.

(Further bivariate analysis on income and zip codes will show better insight)

5.Observation on Family¶

In [172]:
# Histplot & Boxplot
histogram_boxplot(loan, "Family")
In [173]:
# Barplot with percent
labeled_barplot(loan, "Family", perc=True, n=None, ascending = True)

Insight¶

  • No outlier present
  • Right skewed
  • Family size with 1 has the highest count 29.4%
  • Family size of 2 and 4 have similiar count 24.4-26%
  • Family size of 3 is the lowest 20.2%

6.Observation on CCAvg¶

In [174]:
# Histplot & Boxplot
histogram_boxplot(loan, "CCAvg")
In [175]:
# Max value
loan['CCAvg'].max()
Out[175]:
10.0

Insight¶

  • Many outliers present
  • Right skewed
  • Most credit card debt spending ranging from 0 - 3k
  • 4-10k spending has a much lower count compared to 0-3k
  • Max is 10k

7.Observation on Education¶

In [176]:
# Histplot & Boxplot
histogram_boxplot(loan, "Education")
In [177]:
# Barplot
labeled_barplot(loan, 'Education', perc =True)

Insight¶

Undergrad = 1 / Graduate = 2 / Advanced/Professional = 3

  • No outliers present
  • Majority is 1 with 41.9% undergrads
  • 3 and 2 are similiar with 28-30%

8.Observation on Mortgage¶

In [178]:
# Histplot & Boxplot
histogram_boxplot(loan, "Mortgage")
In [179]:
# Total count
loan['Mortgage'].value_counts()
Out[179]:
Mortgage
0      3462
98       17
119      16
89       16
91       16
       ... 
547       1
458       1
505       1
361       1
541       1
Name: count, Length: 347, dtype: int64

Insight¶

  • Many outliers present
  • Right skewed
  • Mortgage runs from 0-635K
  • Majority is 0, with 69% / 3462 counts
  • Top counts for mortgages for 89 - 119k range all having similiar counts 16 to 17 range.

9.Observation on Personal_Loan¶

In [180]:
# Histplot & Boxplot
histogram_boxplot(loan, "Personal_Loan")
In [181]:
# Boxplot
labeled_barplot(loan, 'Personal_Loan', perc =True)

Insight¶

Did this customer accept the personal loan offered in the last campaign?

  • Majority of customers did not accept personal loan in the last campaign

Further analysis

  • Investigate the characteristics of the 9.6% customers who accept the personal loan and see what criteria they were accepted on.

10.Observation on Securities_Account¶

In [182]:
# Histplot & Boxplot
histogram_boxplot(loan, "Securities_Account")
In [183]:
#Boxplot
labeled_barplot(loan, 'Securities_Account', perc =True)

Insight¶

  • Majority does not have a securities account with the bank 89.6%

  • Personal loans and Security have similiar total no and yes percentage

11.Observation on CD_Account¶

In [184]:
# Histplot & Boxplot
histogram_boxplot(loan, "CD_Account")
In [185]:
# Barplot
labeled_barplot(loan, 'CD_Account', perc =True)

Insight¶

  • Majority of customers do not have CD with the bank 94%

  • Only a small majority of customers do 6%

12.Observation on Online¶

In [186]:
# Histplot & Boxplot
histogram_boxplot(loan, "Online")
In [187]:
# Barplot
labeled_barplot(loan, 'Online', perc =True)

Insight¶

  • Majority of customers 59.7% do use online facilities

13.Observation on Credit Card¶

In [188]:
# Histplot & Boxplot
histogram_boxplot(loan, "CreditCard")
In [189]:
# Barplot
labeled_barplot(loan, 'CreditCard', perc =True)

Insight¶

  • 70.6% of customers do not use credit card issued by another bank.

Bivariate Analysis¶

  • Focus on Personal Loan as it is the main business problem.
In [190]:
# Pull up data for reminder
loan.head()
Out[190]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [191]:
# Heatmap
plt.figure(figsize=(12, 7))
sns.heatmap(loan.corr(), annot=True, cmap="coolwarm");

Insight¶

  • Age and Experience have a strong positive correlation: As one gets older so does their work experience.

  • CCavg and Income has a moderate positive correlation: The higher the income the higher credit card average usage.

  • Personal loan and income has a moderate positive correlation: One can borrow more/take out personal loans as their income increases.


In [192]:
#Plot pairplot
sns.pairplot(data=loan, diag_kind="kde")
plt.show();

Insight¶

  • Age and experience have a positive correlation. The higher the age, the more experience a customer has.

  • No other positive correlations found

Personal Loan & Family¶

In [193]:
# Countplot - Personal_ loan & Family
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Family')
plt.ylabel('Count')
plt.title('Personal loan & Family')
plt.xticks(rotation=45);
In [194]:
#Figure size
plt.figure(figsize=(10, 6))

#Boxplot
sns.boxplot(data=loan, x='Personal_Loan', y='Family')
plt.ylabel('Family')
plt.title('Personal Loan & Family')
plt.xticks(rotation=45);

Insight¶

Individuals that take personal loans tend to have larger families on average compared to those who do not.

Without personal loan

  • Majority of customer do not have personal loans, with family of 1 being the highest count without a personal loan

With personal loan

  • Families with 4 or 3 have more personal loans compared to those with 1 or 2 family members.

Personal Loan & Education¶

In [195]:
# Checking the distribution
loan.groupby('Personal_Loan')['Education'].describe()
Out[195]:
count mean std min 25% 50% 75% max
Personal_Loan
0 4520.0 1.843584 0.839975 1.0 1.0 2.0 3.0 3.0
1 480.0 2.233333 0.753373 1.0 2.0 2.0 3.0 3.0
In [196]:
# Countplot - Personal Loan & Education
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Education')
plt.ylabel('Count')
plt.title('Personal loan & Education')
plt.xticks(rotation=45);
In [197]:
#Figure size
plt.figure(figsize=(10, 6))

#Boxplot
sns.boxplot(data=loan, x='Personal_Loan', y='Education')
plt.ylabel('Education')
plt.title('Personal Loan & Education')
plt.xticks(rotation=45);

Insight¶

Undergrad = 1 / Graduate = 2 / Advanced/Professional = 3

  • Higher education levels are associated with a greater likelihood of taking personal loans.

Without personal loan

  • Undergrad level 1 education is the highest majority
  • level 2 and 3 are similiar in count (level 3 slightly higher)

With personal loan

  • Advanced level 3 has the highest personal loans
  • Those with level 2 education have double the count compared to level 1 education individuals

Personal & CCAvg¶

In [198]:
# Checking the distribution
loan.groupby('Personal_Loan')['CCAvg'].describe()
Out[198]:
count mean std min 25% 50% 75% max
Personal_Loan
0 4520.0 1.729009 1.567647 0.0 0.6 1.4 2.3000 8.8
1 480.0 3.905354 2.097681 0.0 2.6 3.8 5.3475 10.0
In [199]:
#Figure size
plt.figure(figsize=(12, 6))

# Histogram and KDE plot for customers with personal loans
sns.histplot(loan[loan['Personal_Loan'] == 1]['CCAvg'], kde=True, color='blue', label='Has Personal Loan', bins=30)

# Histogram and KDE plot for customers without personal loans
sns.histplot(loan[loan['Personal_Loan'] == 0]['CCAvg'], kde=True, color='red', label='No Personal Loan', bins=30)

plt.xlabel('CCAvg')
plt.ylabel('Frequency')
plt.title('CCAvg Distribution: Customers With and Without Personal Loans')
plt.legend()
plt.show()

Insight¶

Without Perosnal loan

  • Count of 4520
  • Mean is 1.7K in spending
  • Right skewed

With Personal Loan

  • Count of 480
  • Mean is 3.9K in spending

Personal Loan & Age¶

In [200]:
# Checking the distribution of age for customers with and without a personal loan
loan.groupby('Personal_Loan')['Age'].describe()
Out[200]:
count mean std min 25% 50% 75% max
Personal_Loan
0 4520.0 45.367257 11.450427 23.0 35.0 45.0 55.0 67.0
1 480.0 45.066667 11.590964 26.0 35.0 45.0 55.0 65.0
In [201]:
# Figure size
plt.figure(figsize=(12, 6))

# Histogram and KDE plot for customers with personal loans
sns.histplot(loan[loan['Personal_Loan'] == 1]['Age'], kde=True, color='blue', label='Has Personal Loan', bins=30)

# Histogram and KDE plot for customers without personal loans
sns.histplot(loan[loan['Personal_Loan'] == 0]['Age'], kde=True, color='red', label='No Personal Loan', bins=30)

plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution: Customers With and Without Personal Loans')
plt.legend()
plt.show()
In [202]:
# Create age bins: 23 -67 is the age range in the dataset
age_bins = [20, 30, 40, 50, 60, 70]
age_labels = ['20-29', '30-39', '40-49', '50-59', '60-69']


# Create a new column for age ranges
loan['Age_Range'] = pd.cut(loan['Age'], bins=age_bins, labels=age_labels, right=False)

# Plot the data using age ranges
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Age_Range')
plt.ylabel('Count')
plt.title('Personal Loan & Age Range')
plt.xticks(rotation=45)
plt.show()
In [203]:
#Figure size
plt.figure(figsize=(10, 6))

#Boxplot
sns.boxplot(data=loan, x='Personal_Loan', y='Age')
plt.ylabel('Age')
plt.title('Personal Loan & Age')
plt.xticks(rotation=45);

Insight¶

  • No outliers present.

  • Mean age is 45 for both groups.

  • Invidiuals aged 30 to 59 have the highest count for personal loans taken. (Age 30-39 is the highest)

  • Ages 20-29 have the lowest count for personal loans taken out, follwowed by 60-69 range.

  • Boxplot shows data is distributed similiarily for both groups. However those with personal loans slightly shorter whiskers on each side.

Further insight

  • Infer that ages 30-59 is when people are done with studying and settled into their careers. Whereas 20-29 people are students mostly and have student loans.

  • Ages 60-69 is more retirement age and focus on retirements option loans.

Domain knowledge: The Office of Federal Student Aid says California is the state with the most federal student loan debt. https://www.ppic.org/publication/student-loan-debt-in-california/

Personal Loan & Income¶

In [204]:
# Checking the distribution of age for customers with and without a personal loan
loan.groupby('Personal_Loan')['Income'].describe()
Out[204]:
count mean std min 25% 50% 75% max
Personal_Loan
0 4520.0 66.237389 40.578534 8.0 35.0 59.0 84.0 224.0
1 480.0 144.745833 31.584429 60.0 122.0 142.5 172.0 203.0
In [205]:
# Figure size
plt.figure(figsize=(12, 6))

# Histogram and KDE plot for customers with personal loans
sns.histplot(loan[loan['Personal_Loan'] == 1]['Income'], kde=True, color='blue', label='Has Personal Loan', bins=30)

# Histogram and KDE plot for customers without personal loans
sns.histplot(loan[loan['Personal_Loan'] == 0]['Income'], kde=True, color='red', label='No Personal Loan', bins=30)

plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Income Distribution: Customers With and Without Personal Loans')
plt.legend()
plt.show()

Insight¶

Individuals with personal loans tend to have a higher income than those who do not have a personal loan.

Without loans

  • Majority of those without personal loans ranges from 16K to 80K income with outlier on the upper end.
  • Mean income is 66K
  • total count of customers is 4520

With Loans

  • Majority of those with personal loans start from 60K to 200K.
  • Mean income is 144K
  • total count os customer is 480

Personal Loan & Online¶

In [206]:
# Countplot - Personal Loan & Online
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Online')
plt.ylabel('Count')
plt.title('Personal loan & Online')
plt.xticks(rotation=45);

Insight¶

  • Majority of customers with or without a personal loan use online facilities as they are customers of the bank.
  • Univariate analysis showed 70% of all total customers use online banking facilities. The remaining 30% that do not are split comparatively between both groups.

Futher insight

  • Analyze what type of banking facilities they use, what purpose and how often to determine a pattern between individuals with or without personal loans.

Personal Loan & Credit Card¶

In [207]:
#Countplot - Personal Loan & Credit Card
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='CreditCard')
plt.ylabel('Count')
plt.title('Personal loan & CreditCard')
plt.xticks(rotation=45);

Insight¶

  • Shows most customers do not own another credit card from another banking institutions with or without personal loans.

  • Those without a personal loan have a higher count of credit cards used by another bank.

  • Univariate analysis showed total 70% of customers do not use another credit card. The remaining 30% that do are split comparatively between both groups.

Further insight

For those without personal loans might have other credit cards as All life banking may have a limit issued to them. Therefore those individuals request additional credit card from other banks.

Personal loan & CD Account¶

In [208]:
# Countplot - Personal Loan & CD Account
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='CD_Account')
plt.ylabel('Count')
plt.title('Personal loan & CD_Account')
plt.xticks(rotation=45);

Insight¶

  • Majority of customers do not have a CD_Account.

  • Individuals with a personal loan are more likely to have a CD_Account as well.

Personal Loan & Security Account¶

In [209]:
# Count - Personal Loan & Security Account
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Securities_Account')
plt.ylabel('Count')
plt.title('Personal loan & Securities_Account')
plt.xticks(rotation=45);

Insight¶

CD_Account & Family¶

In [210]:
# Countplot - CD_account & Family
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='Family')
plt.ylabel('Count')
plt.title('CD_Account & Family')
plt.xticks(rotation=45);

Insight¶

  • Univariate analysis shows only 6% have CD_Accounts and within that percent Family size of 3 has the highest percentage.

CD_Account & Securities¶

In [211]:
# Countplot - CD_account & Securities
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='Securities_Account')
plt.ylabel('Count')
plt.title('CD_Account & Security')
plt.xticks(rotation=45);

Insight¶

  • A higher percentage of people with CD Accounts have Security Accounts compared to those who do not have a CD Account.

CD_Account & Credit Card¶

In [212]:
# Count - CD_account & Credit_Card
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='CreditCard')
plt.ylabel('Count')
plt.title('CD_Account & Credit_Card')
plt.xticks(rotation=45);

Insight¶

  • A higher percentage of customers who have CD_Accounts habe more Credit Card with other banks than compared to those customers who do not have CD_Accounts

CD_account & Online¶

In [213]:
# Countplot CD_Account & Online
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='Online')
plt.ylabel('Count')
plt.title('CD_Account & Online')
plt.xticks(rotation=45);

Insight¶

  • Similiar to Personal Loan and Online
  • Majority of Peeople use online facilities regardless if they have a CD_Account

Age and Education¶

In [214]:
# Create age bins: 23 -67 is the age range in the dataset
age_bins = [20, 30, 40, 50, 60, 70]
age_labels = ['20-29', '30-39', '40-49', '50-59', '60-69']


# Create a new column for age ranges
loan['Age_Range'] = pd.cut(loan['Age'], bins=age_bins, labels=age_labels, right=False)

# Plot the data using age ranges
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Education', hue='Age_Range')
plt.ylabel('Count')
plt.title('Education & Age Range')
plt.xticks(rotation=45)
plt.show()

Insight¶

Undergrad = 1 / Graduate = 2 / Advanced/Professional = 3

Level 1

  • Ages 40-49 has the highest count

Level 2

  • Ages 50-59 and 30-39 have the highest counts

Level 3

  • Age 50-59 have the highest count

Correlation for Target variable¶

In [215]:
#Import Library
from scipy.stats import chi2_contingency

features = ['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
target = 'Personal_Loan'

# Calculate Pearson correlation for continuous variables
corr_matrix = loan[features + [target]].corr()

# Display correlation with the target variable
print("Correlation with Personal Loan:\n", corr_matrix[target].sort_values(ascending=False))

# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

# Calculate and display Chi-square test for categorical variables
def chi_square_test(cat_feature, target_feature):
    contingency_table = pd.crosstab(loan[cat_feature], loan[target_feature])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    return p

categorical_features = ['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']

chi_square_results = {feature: chi_square_test(feature, target) for feature in categorical_features}

print("Chi-square test p-values:\n", chi_square_results)
Correlation with Personal Loan:
 Personal_Loan         1.000000
Income                0.502462
CCAvg                 0.366889
CD_Account            0.316355
Mortgage              0.142095
Education             0.136722
Family                0.061367
Securities_Account    0.021954
Online                0.006278
CreditCard            0.002802
ZIPCode              -0.002974
Experience           -0.007413
Age                  -0.007726
Name: Personal_Loan, dtype: float64
Chi-square test p-values:
 {'Education': 6.991473868665428e-25, 'Securities_Account': 0.14051497326319357, 'CD_Account': 7.398297503329848e-110, 'Online': 0.6928599643141484, 'CreditCard': 0.8843861223314504}

Insight¶

Continuous Variables:

  • Income has a strong positive correlation with Personal_Loan.
  • CCAvg also shows a moderate positive correlation.
  • Other continuous variables show weak correlations.

Chi-square Test Categorical Variables:

  • Education, CD_Account, and CreditCard have significant associations with Personal_Loan (p-values ≤ 0.05).
  • Online does not have a significant association (p-value > 0.05).

Recap of EDA¶

Understanding the data¶

What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?

  • Many outliers present and the data is right skewed. Mortgages range from 0-635K with the majority of customers 68% do not have any mortgages.
  • Data shows the highest count for mortgages range from 89K to 119K

How many customers have credit cards?

  • Data shows 70% of customers do not have another credit card with another bank. While 30% do.
  • Data does not specifically how many customers have credit cards with All Life Bank.

What are the attributes that have a strong correlation with the target attribute (personal loan)?

- Data shows income, CCAvg, Education, CD_Account and Credit Card having the strongest correlation with the target attribute.¶

How does a customer's interest in purchasing a loan vary with their age?

  • There doesn't appear to be a clear pattern indicating that customers of a particular age group are more or less likely to purchase a loan. Both groups have a similar distribution of ages, with no significant outliers or trends.

  • Data shows age ranges from 30-59 having more personal loans than those of 20-29 and 60-69


How does a customer's interest in purchasing a loan vary with their education?

  • Overall, the data suggests that there might be a relationship between education level and the likelihood of having a personal loan, with customers having higher education levels being slightly more inclined to purchase a loan.

  • Data shows individuals with graduate/advanced levels of education have higher personal loans total counts than others level of education.


Additional EDA K-means Clustering¶

Reason for this approach:

K-means clustering can effectively segment customers based on selected features. This approach provides valuable insights into different customer groups, allowing for targeted marketing and personalized services.

Data Preprocessing¶

K-means clustering is sensitive to outliers and best to mask them for more accurate results.

In [216]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy import stats
# Define features
features = ['Age', 'Income', 'Family', 'CreditCard', 'Education', 'Personal_Loan', 'CD_Account', 'Mortgage']

# Select the feature
X = loan[features]

# Scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Calculate Z-scores for each feature
z_scores = np.abs(X_scaled)

# Define threshold for outlier detection (e.g., Z-score > 3)
threshold = 3

# Create a mask to identify outliers
outlier_mask = (z_scores > threshold).any(axis=1)

# Remove outliers from the dataset
X_cleaned = X[~outlier_mask]
In [217]:
summary_stats_cleaned = X_cleaned.describe()
print(summary_stats_cleaned)
               Age       Income       Family   CreditCard    Education  \
count  4300.000000  4300.000000  4300.000000  4300.000000  4300.000000   
mean     45.374651    65.060000     2.384419     0.272326     1.858140   
std      11.453349    39.420602     1.153043     0.445208     0.839181   
min      23.000000     8.000000     1.000000     0.000000     1.000000   
25%      35.000000    35.000000     1.000000     0.000000     1.000000   
50%      45.000000    59.000000     2.000000     0.000000     2.000000   
75%      55.000000    84.000000     3.000000     1.000000     3.000000   
max      67.000000   205.000000     4.000000     1.000000     3.000000   

       Personal_Loan  CD_Account     Mortgage  
count         4300.0      4300.0  4300.000000  
mean             0.0         0.0    46.135116  
std              0.0         0.0    80.343277  
min              0.0         0.0     0.000000  
25%              0.0         0.0     0.000000  
50%              0.0         0.0     0.000000  
75%              0.0         0.0    91.250000  
max              0.0         0.0   359.000000  
In [218]:
# visualize original and cleaned data for each feature
fig, axes = plt.subplots(nrows=2, ncols=len(features), figsize=(15, 6))

for i, feature in enumerate(features):
    axes[0, i].hist(X[feature], bins=30, color='blue', alpha=0.5, label='Original')
    axes[0, i].set_title(feature + ' (Original)')
    axes[1, i].hist(X_cleaned[feature], bins=30, color='red', alpha=0.5, label='Cleaned')
    axes[1, i].set_title(feature + ' (Cleaned)')

plt.tight_layout()
plt.show()
  • Outliers have been masked from each feature column selected.
In [219]:
# Determine how many clusters to use
from sklearn.cluster import KMeans

# Elbow Method
inertia = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(X_cleaned)
    inertia.append(kmeans.inertia_)

#Plot graph
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()
  • Elbow method shows 3 for optimal K value

Further analysis

  • Can use Silhouette method to test if Elbow method is optimal
In [220]:
#Prepare the Data (X_cleaned is the cleaned dataset not loan)
X_cleaned_scaled = scaler.fit_transform(X_cleaned)  # Scale the cleaned dataset if necessary

#Elbow method showed 3
k = 3

# Apply K-means Clustering
kmeans = KMeans(n_clusters=k, random_state=42)

#Fit the Model
kmeans.fit(X_cleaned_scaled)

# Predict Clusters
cluster_labels = kmeans.predict(X_cleaned_scaled)

# Visualize Results
plt.scatter(X_cleaned_scaled[:, 0], X_cleaned_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.title('K-means Clustering Results')
plt.colorbar(label='Cluster')
plt.show()

Insight¶

  • The three clusters are somewhat overlapping but still distinguishable
  • The overlap indicates that there might be some similarities between the customers in different clusters, suggesting that their behavior or characteristics are not entirely distinct.
  • Yellow and Purple clusters seems more similiar compared to Teal color cluster. Appears Income is the main factor that is separating the teal cluster from the rest.
In [221]:
# use X_cleaned (not loan, as error will occur as loan has 5000 entries)
X_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4300 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Age            4300 non-null   int64
 1   Income         4300 non-null   int64
 2   Family         4300 non-null   int64
 3   CreditCard     4300 non-null   int64
 4   Education      4300 non-null   int64
 5   Personal_Loan  4300 non-null   int64
 6   CD_Account     4300 non-null   int64
 7   Mortgage       4300 non-null   int64
dtypes: int64(8)
memory usage: 302.3 KB
In [222]:
# Perform k-means clustering
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(X_cleaned)

# Add cluster labels to your DataFrame
X_cleaned['Cluster'] = cluster_labels

# Group by cluster labels and calculate the mean of numeric features
cluster_analysis = X_cleaned.groupby('Cluster').mean(numeric_only=True)

# Print the cluster analysis
print(cluster_analysis)
               Age     Income    Family  CreditCard  Education  Personal_Loan  \
Cluster                                                                         
0        45.508697  66.992780  2.380374    0.276666   1.852314            0.0   
1        45.281324  46.970449  2.453901    0.268322   1.943262            0.0   
2        44.565111  88.191646  2.270270    0.248157   1.724816            0.0   

         CD_Account    Mortgage  
Cluster                          
0               0.0    0.024943  
1               0.0  119.742317  
2               0.0  238.336609  

Insight¶

Each cluster represents a distinct group of customers with unique characteristics and behaviors related to income, family size, banking products usage, and financial habits.

Interpretation of Each Cluster¶

Cluster 0: Represents individuals with moderate income, small families, low mortgage amounts, and no personal loans or CD accounts.

Cluster 1: Represents individuals with lower income compared to Cluster 0, slightly larger families, and higher mortgage amounts.

Cluster 2: Represents individuals with higher income, smaller families, and significantly higher mortgage amounts compared to the other clusters.


Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [223]:
# Pull up data for review
loan.head()
Out[223]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Age_Range
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0 20-29
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0 40-49
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0 30-39
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0 30-39
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1 30-39
In [224]:
# Check missing values
loan.isnull().sum()
Out[224]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
Age_Range             0
dtype: int64
In [225]:
# outlier detection using boxplot
numeric_columns = loan.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Insight¶

  • There are quite a few outliers in the data especially for Mortgage, Income and CCAvg
  • Outliers will stay as they are valid data points

Model Building¶

In [226]:
# Personal loan as target variable
X = loan.drop(["Personal_Loan"], axis=1)
Y = loan["Personal_Loan"]

X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [227]:
# Print result
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3500, 17)
Shape of test set :  (1500, 17)
Percentage of classes in training set:
Personal_Loan
0    0.905429
1    0.094571
Name: proportion, dtype: float64
Percentage of classes in test set:
Personal_Loan
0    0.900667
1    0.099333
Name: proportion, dtype: float64
  • We had seen that around 90.5% of observations belongs to class 0 (No, persoanl Loan)
  • 9.9% observations belongs to class 1 (Yes, personal loan), and this is preserved in the train and test sets.

Decision Tree (default)¶

In [228]:
# Create model 0
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
Out[228]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Model Evaluation Criterion¶

Using a decision tree to predict whether customers are likely to take a personal loan, and considering that personal loan datasets often have imbalanced classes (with fewer customers taking loans than not), a combination of precision, recall, and F1-score would be best but will focus on recall.

  • By focusing on recall, especially for predicting personal loan ensures that most customers who would choose to take a loan are correctly identified, minimizing the risk of missing potential loan takers while keeping false positives reasonably low.

Why Decision Tree?

  • Robost to outliers which is helpful in this context and business problem.

Model can make wrong predictions as:

  • False Negative (FN): Predicting that a customer will not take a personal loan, but in reality, the customer does take a loan.

  • False Positive (FP): Predicting that a customer will take a personal loan, but in reality, the customer does not take a loan.

Which case is more important?

  • False Negative (FN): Predicting that a customer will not take a personal loan, but in reality, the customer takes a loan.

  • Consequence: The bank misses out on targeting potential loan customers, leading to a loss of potential revenue.

  • False Positive (FP): Predicting that a customer will take a personal loan, but in reality, the customer does not take a loan.

  • Consequence: The bank may incur additional marketing costs by targeting customers who are not interested in loans, leading to inefficient allocation of marketing resources.

How to reduce the losses?

  • Maximize Recall: To minimize the loss from missed opportunities (False Negatives), the bank should focus on maximizing recall. Greater recall increases the chances of identifying all potential loan customers, thus capturing more revenue opportunities.

  • Balance Precision and Recall: While maximizing recall, it is also important to consider precision to avoid excessive marketing costs. Therefore, a good balance between recall and precision is necessary to optimize both marketing efficiency and revenue capture.

Model Functions¶

In [229]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [230]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building¶

In [231]:
confusion_matrix_sklearn(model0, X_train, y_train)
In [232]:
decision_tree_perf_train_without = model_performance_classification_sklearn(
    model0, X_train, y_train
)
decision_tree_perf_train_without
Out[232]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
  • Model is able to perfectly classify all the data points on the training set.
  • 0 errors on the training set, each sample has been classified correctly.
In [233]:
confusion_matrix_sklearn(model0, X_test, y_test)
In [234]:
decision_tree_perf_test_without = model_performance_classification_sklearn(
    model0, X_test, y_test
)
decision_tree_perf_test_without
Out[234]:
Accuracy Recall Precision F1
0 0.978 0.90604 0.876623 0.891089

Insight¶

  • The decision tree model's results are promising, showing high accuracy, recall, precision, and F1 score.

  • Precision score had the biggest change from 1.0 to 0.87.

  • Accuracy remains the highest followed by recall.

Model Performance Improvement¶

Decision Tree with Weights (Class_weights / Balanced Approach)¶

  • If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes

  • In this case, we will set class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data

  • class_weight is a hyperparameter for the decision tree classifier

In [235]:
# Create model 1 with weight class
model = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model.fit(X_train, y_train)
Out[235]:
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In [236]:
confusion_matrix_sklearn(model, X_train, y_train)
In [237]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
Out[237]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
  • Model is able to perfectly classify all the data points on the training set.
  • 0 errors on the training set, each sample has been classified correctly.
  • As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  • This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.
In [238]:
confusion_matrix_sklearn(model, X_train, y_train)
In [239]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test
Out[239]:
Accuracy Recall Precision F1
0 0.977333 0.872483 0.896552 0.884354

Insight¶

  • A more general approach has been taken causing changes

  • Accuracy, recall, Precision and F1 are lower compared to the trianing set but still remain high.

  • Might suggest overfitting


Decision Tree Pre-Pruning Approach¶

Gridsearch for Hyperparameter tuning¶

Reason for use?

  • In the context of determining personal loans, hyperparameter tuning with grid search is crucial for optimizing the decision tree model's performance.

  • maximizes the model's ability to correctly predict whether a customer will accept a personal loan or not.

In [240]:
# Choose classifier
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "class_weight": [None, "balanced"],
    "max_depth": np.arange(2, 7, 2), # [2, 4, 6]
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Recall scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[240]:
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
                       min_samples_split=10, random_state=1)
In [241]:
confusion_matrix_sklearn(estimator, X_train, y_train)
In [242]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train
Out[242]:
Accuracy Recall Precision F1
0 0.790286 1.0 0.310798 0.474212
  • models f1 and precisions have decreased a lot comapred to the class weight model

  • recall remains very high

In [243]:
confusion_matrix_sklearn(estimator, X_test, y_test)
In [244]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)
decision_tree_tune_perf_test
Out[244]:
Accuracy Recall Precision F1
0 0.779333 1.0 0.310417 0.473768
  • Models is able to still get a perfect recall score of 1.0 on both training and test sets which show sthat it can perfectly understand unseen data.

  • Not much change on train and tet set for Accuracy, Precision and F1 score

In [245]:
# Get features from model
feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)
In [246]:
# Visualize decision tree branches
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [247]:
# Text report showing the rules of a decision tree
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1344.67, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [64.61, 79.31] class: 1
|--- Income >  92.50
|   |--- Education <= 1.50
|   |   |--- weights: [272.80, 306.65] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [67.92, 1364.05] class: 1

Insight from Pre Pruning Approach¶

Summary: High-Income, High Education: Most promising group for personal loans and should be prioritized for targeted marketing.


Income and Credit Card Spending (CCAvg): Income ≤ 92.50:

  • CCAvg ≤ 2.95: Customers in this group are very unlikely to take personal loans.
  • CCAvg > 2.95: Customers with higher credit card spending are more likely to take personal loans.

Income > 92.50:

  • Education ≤ 1.50: Customers with lower education levels still show a tendency towards taking personal loans.
  • Education > 1.50: Customers with higher education levels are very likely to take personal loans.
In [248]:
# Get feature importances from model
importances = estimator.feature_importances_
importances
Out[248]:
array([0.        , 0.        , 0.        , 0.82007181, 0.        ,
       0.        , 0.06262835, 0.11729984, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])
In [249]:
# Importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Insight from Feature Importance¶

  • Income having the highest relavence with 8.0, followed by Education and CCAvg are the most important features.

Decision Tree Post Pruning¶

Reason for use?

  • To prevent overfitting
  • compare how it performs compared to the basic and pre pruning decision trees.
  • Compare this model to the others to see which best performs
In [250]:
# total impurity of the leaves in a decision tree
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced") # {0: 0.15, 1: 0.85}
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [251]:
pd.DataFrame(path)
Out[251]:
ccp_alphas impurities
0 0.000000e+00 -7.832774e-15
1 3.853725e-19 -7.832388e-15
2 4.729571e-19 -7.831915e-15
3 4.729571e-19 -7.831442e-15
4 7.707449e-19 -7.830671e-15
5 1.051016e-18 -7.829620e-15
6 1.261219e-18 -7.828359e-15
7 8.338059e-18 -7.820021e-15
8 1.257806e-17 -7.807443e-15
9 1.574681e-04 3.149363e-04
10 2.857143e-04 8.863649e-04
11 3.083987e-04 1.503162e-03
12 3.116508e-04 2.438115e-03
13 3.130853e-04 3.064285e-03
14 3.604006e-04 4.505887e-03
15 3.628623e-04 5.957336e-03
16 3.797005e-04 7.476138e-03
17 5.220569e-04 7.998195e-03
18 5.375794e-04 8.535775e-03
19 5.880239e-04 9.711822e-03
20 7.689471e-04 1.048077e-02
21 1.003878e-03 1.148465e-02
22 1.213013e-03 1.391067e-02
23 1.343845e-03 1.525452e-02
24 1.416204e-03 1.667072e-02
25 1.431094e-03 1.953291e-02
26 1.693744e-03 2.292040e-02
27 1.981730e-03 2.688386e-02
28 2.150414e-03 2.903427e-02
29 2.375809e-03 3.141008e-02
30 3.344493e-03 3.475457e-02
31 3.602932e-03 4.196044e-02
32 3.729690e-03 4.569013e-02
33 4.920880e-03 5.061101e-02
34 1.007808e-02 7.076717e-02
35 2.255792e-02 9.332509e-02
36 5.564782e-02 2.046207e-01
37 2.953793e-01 5.000000e-01
In [252]:
# Figure size
fig, ax = plt.subplots(figsize=(10, 5))

# PLot graph
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Reason for use

  • displays the relationship between the effective alpha values and the total impurity of the leaves in a decision tree
  • make informed decisions on how much to prune your decision tree to achieve optimal performance.
In [253]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2953792759992323
In [254]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
In [255]:
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)
In [256]:
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [257]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [258]:
# Compare alphas & Recall - train & test
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

Insight¶

  • These lines show how the recall scores change with different values of alpha for both datasets.

  • Both train and test have high recall and follow each other meaning the model generalizes well to unseen data

In [259]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.002375808619774645, class_weight='balanced',
                       random_state=1)
In [260]:
# Create best_model
confusion_matrix_sklearn(best_model, X_train, y_train)
In [261]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_post_perf_train
Out[261]:
Accuracy Recall Precision F1
0 0.956857 1.0 0.686722 0.814268
  • Accuracy and Recall very high score. Precision and F1 increased from pre-pruning model
In [262]:
confusion_matrix_sklearn(best_model, X_test, y_test)
In [263]:
decision_tree_post_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)
decision_tree_post_test
Out[263]:
Accuracy Recall Precision F1
0 0.948667 0.993289 0.660714 0.793566

Insight¶

  • Recall remains very high with .99
  • Accuracy improved form 0.7 to 0.94,
  • Precision and F1 slightly decreased
  • Meaning this approach is dealing well with unseen data
In [264]:
# Visualize decision tree branches
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [265]:
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1344.67, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- weights: [41.42, 52.87] class: 1
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [23.19, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [0.00, 26.44] class: 1
|--- Income >  92.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 103.50
|   |   |   |   |--- CCAvg <= 3.21
|   |   |   |   |   |--- weights: [22.09, 0.00] class: 0
|   |   |   |   |--- CCAvg >  3.21
|   |   |   |   |   |--- weights: [2.76, 15.86] class: 1
|   |   |   |--- Income >  103.50
|   |   |   |   |--- weights: [239.11, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [8.84, 290.79] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 116.50
|   |   |   |--- CCAvg <= 2.85
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [37.55, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- weights: [23.75, 37.01] class: 1
|   |   |   |--- CCAvg >  2.85
|   |   |   |   |--- weights: [6.63, 153.32] class: 1
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 1173.72] class: 1

Insight¶

Income and Credit Card Average Spend (CCAvg):

-Individuals with lower income and lower credit card average spend (CCAvg) are less likely to take out personal loans.

  • Those with higher CCAvg and income above a certain threshold are more likely to take out personal loans, especially if they don't have a CD account.

Education and Family Size:

  • Individuals with lower education levels and smaller family sizes are more likely to take out personal loans, particularly if their income falls within a specific range.

  • Higher income individuals with higher education levels are also more likely to take out personal loans, especially if they have moderate to high credit card average spend.

In [266]:
# Importance of features in the tree building
importances = best_model.feature_importances_
indices = np.argsort(importances)
In [267]:
# Importance of features in the tree building
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Insight from Feature Importance¶

  • Income remains high in both pre and post pruning decision tree models

  • Post pruning includes more features such as Family and CD_account (income, family, education, CCAvg, CD_Account)


Model Comparison and Final Model Selection¶

Compare the training performance and testing models for all models built (default, weight class, pre-pruning and post pruning)¶

In [268]:
# Training performance comparison
models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train_without.T,
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[268]:
Decision Tree without class_weight Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 1.0 0.790286 0.956857
Recall 1.0 1.0 1.000000 1.000000
Precision 1.0 1.0 0.310798 0.686722
F1 1.0 1.0 0.474212 0.814268
In [269]:
# Testing performance comparison
models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test_without.T,
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[269]:
Decision Tree without class_weight Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.978000 0.977333 0.779333 0.948667
Recall 0.906040 0.872483 1.000000 0.993289
Precision 0.876623 0.896552 0.310417 0.660714
F1 0.891089 0.884354 0.473768 0.793566

Insight from model comparison¶

  • Overall, the decision tree with post-pruning seems to strike the best balance between model complexity and generalization performance, achieving high accuracy and recall on both the training and test sets.

  • By focusing on recall, especially for predicting personal loan uptake, post-pruning ensures that most customers who would choose to take a loan are correctly identified, minimizing the risk of missing potential loan takers while keeping false positives reasonably low.tly identified, minimizing the risk of missing potential loan takers.


Conclusion Summary¶

Analyzed the dataset to uncover patterns and insights, focusing on identifying key indicators of whether a customer is likely to take a personal loan in the next campaign. Employed standard Exploratory Data Analysis (EDA) techniques, followed by K-means clustering and decision tree models to pinpoint characteristics of potential customers.

Actionable Insights and Business Recommendations¶

  • Overall, the majority of All Life Banking customers within this dataset did not accept a personal loan in the last campaign.

  • Most important features that predict if someone will accept the personal loan are: Income, Education, Family and CCAvg spending.

  • There are potential customers the bank can target to increase their conversion rate on the next campaign. An ideal customer to target would be:

    • Higher income individuals: 98K to 200K
    • Higher educated individuals: Level 3
    • Higher CCAvg spending Individuals: 3K+
    • Higher Family size numbers: 3+

Consider the following

  • The majority of customers (70%) use online facilities. User friendly methods should be incorporated to make sure getting a personal loan is user friendly once customers are targeted.
  • The highest count zipcode were 94720 (Berkeley, CA), 169 count and 94305 (Stanford, CA), 127 count. Making these location a priority due to high interactions with potential customers and potential to maximize campaign results.

Further Analysis & Data¶

  • Zip codes provide valuable insight to location and economic averages for customers in those areas. Analyzing the different location can further segment customers allowing the bank to better target them for future campaigns.

  • More data on online services: when, why and how long. Can be used to discover pain points for customers.

  • How long they have been a customer/What year did they join. Who is most likely to take out a loan based on when they joined the bank could be helpful for future campaigns.


Marketing Team Recommendations¶

Summary Strategy

By understanding the unique characteristics and preferences of each cluster, organizations can develop targeted marketing strategies that resonate with their audience, drive engagement, and ultimately lead to increased customer satisfaction and loyalty

Marketing Strategies for Each Cluster

Cluster 0¶

  • Targeted Messaging: Craft messaging that emphasizes financial stability and responsible spending.

  • Product Offerings: Offer low-interest credit cards or savings accounts to encourage savings and responsible credit card usage. Provide educational resources on financial planning and budgeting to support their financial goals.

  • Promotions: Offer promotions or rewards for opening new savings accounts or credit card accounts.

  • Personalization: Use personalized marketing campaigns based on their specific financial needs and preferences.

Cluster 1¶

  • Financial Education: Provide resources and workshops on managing finances, especially focusing on budgeting and debt management.

  • Mortgage Services: Offer mortgage refinancing options or home equity loans with attractive rates to assist with homeownership goals.

  • Credit Building Products: Provide credit-building products or services to help improve credit scores and qualify for better financial opportunities.

  • Family-Oriented Promotions: Create family-oriented promotions or events to appeal to their slightly larger family size.

Cluster 2¶

  • Affluent Lifestyle: Highlight exclusive or premium banking services, such as concierge banking or wealth management services, tailored to their higher income level.

  • Investment Opportunities: Offer investment products or portfolio management services to grow their wealth.

  • Luxury Rewards: Provide luxury rewards or perks for high-value clients, such as exclusive access to events or travel benefits.

  • Personalized Financial Planning: Offer personalized financial planning services to help them achieve their long-term financial goals.